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B.  INTRODUCTION 

Interval-censored  (IC)  data  are  encountered  in  three  areas  of  breast  cancer  research. 
The  most  common  application  is  in  clinical  relapse  follow-up  studies  in  which  the  study 
endpoint  is  disease-free  survival.  When  a  patient  relapses,  it  is  usually  known  that  the 
relapse  takes  place  between  two  follow-up  visits,  and  the  exact  time  to  relapse  is  unknown. 
In  statistics,  we  say  relapse  time  is  interval  censored.  Interval  censoring  is  also  encountered 
in  breast  cancer  registry  studies  in  which  information  on  family  history  of  cancer  is  updated 
periodically.  The  Strang  Breast  Surveillance  Program  for  women  at  increased  risk  for  breast 
cancer,  for  instance,  has  enlisted  over  800  women  with  complete  pedigree  information  which 
is  verified  and  updated  continuously.  Family  history  data  such  as  age  at  diagnosis  of  a 
specific  cancer,  or  a  benign  but  risk-conferring  condition,  are  obtained  from  each  registrant 
at  each  update.  Time  to  a  cancer  event,  and  definitely  time  to  first  detection  of  a  benign 
condition,  are  at  best  known  to  fall  in  the  time  interval  between  the  last  update  and  age 
at  diagnosis.  A  third  but  increasingly  important  area  of  application  of  interval  censoring 
is  in  breast  cancer  chemoprevention  experiments  or  prevention  trials,  which  involve  the 
observation  of  one  or  more  surrogate  endpoint  biomarkers  (SEB)  over  time.  The  scientific 
question  of  interest  here  is  the  estimation  of  time  for  the  SEB  to  reach  a  target  value, 
and  time  from  cessation  of  intake  of  a  chemopreventive  agent  to  the  loss  of  its  protective 
effect.  Unfortunately,  the  exact  values  of  both  these  time  variables  are  known  only  to  lie  in 
between  two  successive  assay  inspection  times.  In  a  breast  cancer  follow-up  study,  we  will 
often  encounter  covariates  (for  instance,  tumor  size  and  nodal  status  in  a  relapse  study,  and 
baseline  SEB  value  in  a  chemoprevention  trial). 

Let  X  denote  a  time-to-event  variable  with  distribution  F(x)  =  Pr(X  <  x),  or  equiv¬ 
alently,  survival  function  S(x)  =  1  —  F(x).  In  interval  censoring,  X  is  not  observed  and 
is  known  only  to  lie  in  an  observable  interval  ( L,R ).  In  our  previous  DOD  funded  grant, 
we  have  made  fundamental  contributions  to  both  the  theory  of  the  generalized  maximum 
likelihood  (GML)  estimation  of  S,  and  the  computation  in  connection  with  the  inference  of 
GML  estimator  (GMLE)  S  of  S.  These  contributions  are  restricted  to  the  case  of  univariate 
interval-censored  data  without  covariates. 

The  Cox  proportional  hazards  regression  model  [1]  specifies  that  covariates  have  a 
proportional  effect  on  the  hazard  function  of  X.  This  model  provides  powerful  means  for 
fitting  failure  time  observations  to  a  distribution  free  model  and  for  estimating  the  risk  for 
failure  associated  with  a  vector  of  covariates.  It  is  extensively  used  for  right-censored  data. 
Finkelstein  [2]  applied  the  Cox  model  to  analysis  of  interval-censored  data.  However,  she 
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did  not  establish  asymptotic  properties  of  the  GMLE  of  the  parameters  in  the  model  and 
the  approach  is  limited  to  small  sample  sizes  due  to  computational  difficulty  . 

Our  interest  in  IC  data  with  covariates  is  driven  by  needs  arising  from  two  related 
areas  of  breast  cancer  research  at  Strang.  First,  our  investigators  in  the  Strang  Cancer 
Genetics  Program  want  to  study  various  patterns  of  familial  aggregation  of  breast,  ovarian 
and  other  forms  of  cancer  using  family  history  data  from  the  Strang  Breast  Surveillance 
Program.  Studies  of  familial  early  onset  of  breast  cancer,  breast-ovarian  and  breast-prostate 
associations  will  lead  to  IC  data  with  covariates;  therefore,  a  proper  statistical  procedure 
together  with  a  feasible  software  to  deal  with  such  data  are  very  much  needed.  Second, 
we  conducted  a  one-year  chemoprevention  trial  of  indole-3-carbinol  (I3C)  for  breast  cancer 
prevention.  In  this  prevention  trial  we  monitored  the  levels  of  two  SEB’s,  a  urinary  estrogen 
metabolite  ratio  and  a  blood  counterpart,  both  of  which  are  subject  to  interval  censoring. 
An  earlier  dose-ranging  study  of  I3C  conducted  by  Wong  et  al  [3]  has  been  published. 

The  overall  aim  of  this  research  proposal  is  to  develop  statistical  inference  for  interval- 
censored  data  with  covariates  that  are  encountered  in  breast  cancer  chemoprevention  trials 
employing  surrogate  endpoint  biomarkers,  and  in  breast  cancer  registry  follow-up  studies  of 
familial  aggregation  of  breast  and  other  forms  of  cancer.  Asymptotic  generalized  maximum 
likelihood  theory  under  the  Cox  regression  model  will  be  investigated  and  computer  software 
package  for  maximum  likelihood  inference  will  be  implemented. 

C.  BODY 

C.l.  Model  Formulation  and  Likelihood  Equations. 

Let  Yk, i  <  Yk, 2  <  •  •  •  <  Yk,k  denote  the  follow-up  times  for  a  patient  who  has  made 
K  follow-up  visits,  in  a  longitudinal  follow-up  study.  Since  the  number  of  visits  for  each 
patient  may  vary,  K  is  a  random  positive  integer.  For  convenience,  define  Yk, o  =  0  and 
Yk,k+ i  =  oo.  The  time-to-event  variable  of  interest,  X,  is  not  directly  observed;  instead,  it 
is  known  to  lie  in  between  two  successive  censoring  time  points  (Ykj,  Ykj+i),  where  j  =0, 
...,  K.  Note  that  X  is  left  censored  if  j  =  0,  strictly  interval  censored  if  0  <  j  <  K,  and 
right  censored  if  X  >  Yk,k •  The  observable  interval-censored  data  corresponding  to  X  is 
given  by 

(L,R)  =  (YK,i,YK,i+1)  if  YKti  <X<  YK,i+i,  •  =  0,1  ,-,K.  (1) 

In  addition  to  ( L,R ),  we  also  observe  a  p  x  1  covariate  vector  Z.  We  assume  that  K 
and  the  Yfc^’s  are  independent  of  (X,  Z). 
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The  Cox  regression  model  for  the  survival  function  at  X  =  x  given  Z  —  z  is  represented 
by 

S(x\z)  =  [So{x)]eZ0,  (2) 

where  zfi  is  the  dot  product  of  Z  and  (3,  S0(x)  is  a  baseline  survival  function  and  ft  is  a 
p-dimensional  regression  coefficient  vector. 

Let  h  =  ( Li,Ri,Zi ),  i  —  1,  n,  be  a  random  sample  of  size  n  interval-censored 
observations  with  covariates.  In  terms  of  the  original  observed  intervals,  the  likelihood 
function  of  S  and  b  is  given  by 

L  =  n((S(Ii))'*’,-(S(^))'*'‘),  (3) 

i= 1 

where  S'  is  a  survival  function,  and  b  is  a  p  x  1  dimensional  vector.  The  GMLE  of  (S0,  (3)  is 
a  value  (S,  b)  that  maximizes  (3)  over  all  survival  functions  S  and  all  b  E  1ZP. 

Since  Sa  places  all  probability  mass  on  the  innermost  intervals  of  the  ij’s  (see  Peto 
[4]  or  Turnbull  [5]),  it  is  often  computationally  simpler  to  express  L  in  terms  of  innermost 
intervals. 

We  say  that  an  interval  A  is  an  innermost  interval  of  the  Ii  s  if  A  is  a  nonempty  finite 
intersection  of  one  or  more  of  the  Jj’s  such  that  either  Ii  fl  A  =  0  or  Ii  D  A  =  A  for  each 
i.  Suppose  there  are  a  total  of  m  distinct  innermost  intervals  Ai  =  (^,  r/i],  where  rji  < 
and  m  <n.  Then  the  likelihood  function  (3)  is  equivalently  given  by 

L=n((S>r‘-(£s*>n  w 

i=l  k>li  k>ri 

where  k  =  sup{j  :  r)j  <  Li},  =  sup{j  :  r)j  <  Ri}  and  s  =  (si, ...,  sm)  denote  the  vector 
of  the  probability  weights.  The  log  likelihood  of  (s,  b)  is 

£(«,  io = £  in[(XD r“ ‘  -  ( £  »*)'"“]•  (5) 

2=1  k>h  k>ri 

Note  that  (X)fe>r<  sk)e‘lb  =  1  if  n  =  0  and  (Y,k>ii  sk)e*'b  =  0  if  k  =  m. 

C.2.  Generalized  maximum  likelihood  estimation. 

A  GMLE  of  (s,/3)  is  a  value  of  (s,  b)  that  maximizes  the  likelihood  function  (5).  We 
could  follow  the  Newton- Raphson  (NR)  algorithm  taken  by  Finkelstein  [2].  However,  this 
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would  involve  the  inverse  of  a  matrix  of  order  (m  +  p  —  1)  x  (m  +  p  —  1).  Since  m  can  be 
potentially  large  when  n  is  large,  the  unmodified  NR  algorithm  is  not  feasible  for  a  large 
data  set. 

We  advocate  a  computationally  simple  approach  by  first  grouping  the  original  data 
(Li,Ri)  and  then  applying  a  two-step  iterative  scheme  to  obtain  the  two-step  estimators 
(TSE)  of  S0  and  (3  based  on  the  innermost  intervals  corresponding  to  the  grouped  intervals. 

In  the  first  year  of  our  research,  we  have  successfully  implemented  a  computer  software 
to  calculate  the  TSE’s  of  Sa  and  /3.  A  manuscript  on  the  two-step  computation  scheme, 
including  simulation  studies  investigating  sensitivity  of  estimated  values  of  TSE  to  partition 
sizes,  is  ready  for  submission  to  a  statistical  journal  ([7]). 

In  our  second  year  of  research,  we  have  applied  our  two-step  estimation  procedure  to 
the  Cox  regression  analysis  of  a  long-term  prognostic  follow-up  study  involving  375  women 
with  unilateral  T1-2N0,  T1-2N1  and  T3-4  breast  cancer.  All  the  patients  were  treated  at 
Memorial  Sloan  Kettering  Cancer  Center  and  the  follow-up  are  being  conducted  at  Strang 
Cancer  Prevention  Center.  The  main  objective  of  the  study  is  to  assess  the  prognostic 
significance  of  bone  marrow  micrometastasis  (BMM)  in  predicting  relapse.  Standard  clin¬ 
ical  variables  including  nodal  status  and  tumor  diameter  were  included  in  the  Cox  model. 
Although  we  have  not  yet  established  asymptotic  normality  to  validate  the  P  values  that 
were  reported  for  the  study,  our  two-step  Cox  regression  analysis  gave  strong  indication  that 
BMM  was  not  as  predictive  of  relapse  as  previously  expected  (Osborne  and  Wong  [6]).  We 
shall  return  to  the  BMM  analysis  when  we  fully  establish  the  asymptotic  normality  of  the 
GMLE  of  the  Cox  regression  parameters.  In  our  second  year  of  research,  therefore,  we  have 
moved  ahead  of  our  statement  of  work  by  making  a  start  for  Task  8.  Since  the  BMM  relapse 
follow-up  study  provides  a  complete  and  final  data  set  that  optimally  satisfies  our  need  of 
an  empirical  example  to  illustrate  our  asymptotic  GML  procedure  for  Cox  regression,  we 
have  chosen  to  focus  on  this  data  set  instead  of  the  examples  mentioned  in  Task  8. 

Also,  in  the  second  year  of  our  research,  we  have  established  consistency  of  the  GMLE 
of  and  S0  (and  hence  6'( jz))  under  the  following  assumptions: 

ASl:  S0  is  arbitrary  and  each  of  the  censoring  variables, Y\,  Yk  takes  on  finitely  many 

values. 

AS2:  S0  is  arbitrary  and  each  of  the  censoring  variables, Yi,  Yk  is  continuous  and  some 

regularity  conditions  are  imposed  on  either  S0  or  the  joint  distribution  function  G  of  K,Y\, 
Yk- 


8 


Specifically,  under  ASl  and  AS2 


Pr{  lim  0  =  0}  =  1,  (6) 

n—¥o o 

and 

Pr{  lim  sup  \SQ(t)  -  S0(£)|  =  0}  =  1,  (7) 

where  H  denotes  the  support  set  of  Yi, YK-  Note  that  SQ(t)  is  guaranteed  to  be  consistent 
for  t  €  H,  and  not  elsewhere.  However,  the  set  H  is  not  necessarily  a  time  interval  (for 
instance,  H  may  be  a  collection  of  discrete  points).  In  order  for  the  consistency  results  to 
be  more  useful,  we  have  established  that  if  Sa  is  continuous,  and  the  support  of  Yi,  ...,  Yk 
is  dense  in  [0 ,T]  for  some  T  >  0,  then  S0(t)  is  consistent  for  all  t  e  [0,T].  The  practical 
implication  of  the  denseness  requirement  is  that  pointwise  consistency  of  Sc(t )  would  hold 
only  if  all  the  subjects  in  a  follow-up  study  must  be  followed  at  very  frequent  close  intervals. 

We  have  also  established  similar  consistency  results  for  the  TSE,  with  an  added  as¬ 
sumption  that  the  maximal  length  of  the  partition  interval  tends  to  0  as  n  tends  to  oo. 
These  results  are  summarized  in  Wong  and  Yu  [7]. 

Asymptotic  normality  is  the  most  crucial  aspect  of  our  research  because  it  is  needed  in 
making  confidence  statements  and  in  performing  hypothesis  testing.  In  the  third  year  of 
our  research,  we  have  investigated  asymptotic  normality  under  assumptions 
AS3.  S0  is  arbitrary  and  satisfies  a  monotonicity  condition,  and  each  of  Yk,i,  ...,  Yk,k  takes 
on  finitely  many  values; 

AS4.  S0  is  as  in  AS3,  and  each  of  Yk, i,  ...,  Yk,k  takes  on  countably  many  values; 

AS5.  S0  is  as  in  AS3,  each  of  Yk, i,  Yk,k  is  continuous  and  some  regularity  conditions  are 
imposed  on  either  ^  or  G. 

Asymptotic  normality  of  GMLE  or  TSE  is  straightforward  to  establish  under  the  finite  as¬ 
sumption  AS3.  As  for  AS4  and  AS5,  we  have  carried  out  extensive  simulation  studies  to 
guide  our  research.  The  studies  suggest  that  both  GMLE  and  TSE  of  0  and  S0  are  asymp¬ 
totically  normal  under  AS4.  However,  only  GMLE  and  TSE  of  0  can  be  asymptotically 
normal  under  AS5.  We  have  just  completed  theoretical  proofs  to  substantiate  our  numeri¬ 
cal  studies.  A  manuscript  is  being  prepared  to  report  our  findings  (see  Yu  and  Wong  [8]). 
Our  simulation  studies  suggest  that  under  AS5  asymptotic  inference  for  GMLE  and  TSE  of 
S0,  and  hence  5(-|z)  will  have  to  be  accomplished  via  a  bootstrap  method.  We  shall  defer 
this  aspect  of  research  to  the  fourth  and  final  year  of  our  DOD  grant. 
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Cox  regression  is  appropriate  only  if  proportional  hazards  (PH)  assumption  is  satisfied 
by  the  data.  Under  the  PH  assumption,  the  log-rank  test  is  most  powerful.  At  present, 
a  statistically  useful  diagnostic  plot  for  PH  assumption  is  lacking.  Moreover,  a  formal 
significant  test  is  not  available.  In  the  third  year  of  our  research,  we  have  provided  statistical 
solutions  to  satisfy  both  these  needs.  We  are  preparing  a  manuscript  to  report  this  particular 
piece  of  research,  which  was  not  proposed  in  the  original  statement  of  work  (see  Wong  and 
Yu  [9]). 


D.  KEY  RESEARCH  ACCOMPLISHMENTS 

•  We  have  implemented  a  statistical  algorithm  for  computing  GMLE  of  the  regression 
coefficients  (3  and  the  baseline  survival  function  S0. 

•  We  have  implemented  a  statistical  algorithm  for  computing  TSE  of  [3  and  Sa. 

•  Computer  programs  for  both  GMLE  and  TSE  calculations  have  been  made  available 
to  the  public  via  the  internet. 

•  We  have  proved  consistency  of  GMLE  and  TSE  of  (3  and  S0  under  both  discrete  and 
continuous  assumptions  about  the  censoring  distribution  G. 

•  We  have  performed  extensive  simulation  studies  to  investigate  the  asymptotic  properties 
of  GMLE  and  TSE  of  (3  and  SQ.  Our  results  have  provided  strong  evidence  that  SQ  is 
NOT  asymptotic  normal  when  G  is  continuous. 

•  We  have  derived  the  asymptotic  normal  means  and  covariance  matrices  of  GMLE  and 
TSE  of  f3. 

•  When  G  is  finite  or  countably  infinite,  we  have  derived  the  asymptotic  means  and 
covariance  matrices  of  GMLE  and  TSE  of  SQ. 

•  We  have  proposed  a  diagnostic  plot  for  checking  proportional  hazards  assumption  for 
Cox  regression  and  constructed  a  chi-square  test  for  assessing  this  assumption. 

•  We  have  begun  asymptotic  GML  Cox  regression  analysis  of  a  long-term  breast  cancer 
follow-up  study  assessing  the  prognostic  significance  of  bone  marrow  micrometastasis 
in  predicting  relapse  in  a  cohort  of  375  women. 

E.  REPORTABLE  OUTCOMES 

•  An  oral  presentation  of  an  abstract  at  2002  ASCO  Meeting  ([6]). 

•  An  abstract  published  in  2002  ASCO  proceedings  ([6]). 

•  A  manuscript  on  computation  of  GMLE  and  TSE  of  Cox  regression  parameters  ([7]). 

•  A  manuscript  on  consistency  and  asymptotic  normality  of  GMLE  and  TSE  ([8]). 
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•  A  manuscript  on  assessing  the  appropriateness  of  proportional  hazards  assumption  for 

Cox  regression  ([9]). 

•  Computer  programs  for  calculating  GMLE  and  TSE  made  available  for  the  public  via 

the  internet  site  http://www.math.binghamton.edu/qyu/index.html. 

F.  CONCLUSIONS 

In  the  three  years  of  our  DOD  grant,  we  have  successfully  accomplished  most  of  our 
research  objectives  in  developing  asymptotic  generalized  maximum  likelihood  inference  of 
Cox  proportional  hazards  regression  model.  We  have  developed  statistical  algorithms  that 
can  efficiently  compute  GMLE  and  TSE  of  the  regression  coefficients  (3  and  the  baseline 
survival  function  S0  for  any  reasonable  sample  size.  We  have  proved  consistency  of  GMLE 
and  TSE  of  /3  and  S0  under  both  discrete  and  continuous  assumptions  about  the  censoring 
distribution  G.  We  have  established  asymptotic  normality  for  GMLE  and  TSE  of  (3  for  G 
unrestricted.  When  G  is  continuous,  we  have  numerically  demonstrated  that  GMLE  and 
TSE  of  S0  are  not  asymptotically  normal.  We  propose  to  complete  Task  5(c)  and  Task  6(c) 
in  the  fourth  and  final  year  of  our  DOD  grant  by  investigating  a  bootstrap  method  for  the 
asymptotic  interval  estimates  of  S0. 

Cox  regression  is  appropriate  only  if  proportional  hazards  (PH)  assumption  is  satisfied 
by  the  data.  We  have  proposed  a  useful  diagnostic  plot  for  PH  assumption  and  validated  a 
chi-square  test  for  it. 

In  the  fourth  and  final  year  of  research,  we  shall  complete  a  computer  software  for 
asymptotic  confidence  intervals  and  hypothesis  testing  for  GMLE  and  TSE  of  (3  and  S'(-|z) 
(Task  6(c)).  We  shall  also  complete  the  data  analysis  of  the  BMM  prognostic  study. 

The  results  which  we  have  established  will  be  useful  to  breast  cancer  researchers  pur¬ 
suing  chemoprevention  intervention  trials  involving  surrogate  endpoints  biomarkers,  and 
genetic  epidemiologists  conducting  studies  on  familial  aggregation  of  breast  cancer  and  re¬ 
lated  cancers. 


11 


G.  REFERENCES 

[1]  Cox,  D.R.  (1972).  Regression  models  and  life  tables.  J.  Roy.  Statist.  Soc.  B,  34 
187-220. 

[2]  Finkelstein,  D.M.  (1986).  A  proportional  hazards  model  for  interval-censored  failure 
time  data.  Biometrics,  42  845-854. 

[3]  Wong,  GY,  Bradlow,  HL,  Sepkovic,  D,  Mehl,  S,  Mailman,  J,  and  Osborne,  MP  (1997). 
A  dose-ranging  study  of  indole-3-carbinol  for  breast  cancer  prevention.  Journal  of 
Cellular  Biochemistry  Supplement  28/29  111-116. 

[4]  Peto,  R.  (1973).  Experimental  survival  curves  for  interval-censored  data.  Appl.  Statist. 
22,  86-91. 

[5]  Turnbull,  B.  W.  (1976).  The  empirical  distribution  function  with  arbitrary  grouped, 
censored  and  truncated  data.  J.  Roy.  Statist.  Soc.  Ser.  B,  38,  290-295. 

[6]  Osborne,  MP  and  Wong  GYC.  (2002).  Breast  cancer  bone  marrow  micrometastases: 
a  long-term  prognostic  study  of  systemic  tumor  cell  burden  on  relapse.  Proceedings  of 
American  Society  of  Clinical  Oncology,  21,  #228. 

[7]  Wong,  G.Y.C.  and  Yu,  Q.Q.  (2003).  Estimation  under  the  Cox  regression  model  with 
interval-censored  data.  (Under  preparation). 

[8]  Yu,  Q.Q  and  Wong,  G.Y.C.  (2003).  Asymptotic  properties  of  the  GMLE  under  the  Cox 
regression  model  with  interval-censored  data.  (Under  preparation). 

[9]  Wong,  G.Y.C.  and  Yu,  Q.Q.  (2003).  A  Test  for  checking  Cox’s  model  with  a  dichotomous 
covariate.  (Under  preparation). 


12 


